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1.  DETAILED  SUMMARY  OF  PROGRESS 

The  objective  of  this  research  was  to  develop  new,  cost-effective  techniques  for  fault  tolerance  in  multicom¬ 
puter  architectures.  The  requirements  for  high  performance  and  fault  tolerance  are  seemingly  contradictory:  paral¬ 
lel  architectures  and  algorithms  developed  for  high  performance  attempt  to  achieve  maximum  utilization  of  each  of 
the  processors,  while  fault  tolerance  requires  redundant  computations  and  checks  to  ensure  that  the  results  of  the 
computations  are  correct.  The  result  is  that  conventional  fault  tolerance  techniques  are  very  expensive  when 
applied  to  highly  parallel  multicomputer  architectures.  Our  unique  approach  to  achieve  fault  tolerance  in  multi¬ 
computer  parallel  architectures  is  to  use  an  algorithm-based  fault  tolerance  (ABFT)  technique  which  is  an  on-line 
system-level  method  for  detecdon  of  faults  followed  by  a  system  level  approach  to  reconfiguration  and  recovery  of 
a  parallel  processor  system. 

In  the  past  three  and  half  years,  this  research  pursued  four  major  topics:  (1)  An  evaluation  of  algorithm- 
based  fault  tolerance  on  general  purpose  multiprocessors  in  the  presence  of  round-off  errors.  (2)  Development  of 
novel  algorithm-based  fault  tolerance  techniques  for  solutions  of  partial  differential  equations.  (3)  Development  of 
compiler-assisted  automated  synthesis  of  algorithm-based  fault  detection  techniques;  (4)  Investigation  of  system- 
level  reconfiguration  and  recovery  techniques;  In  the  following  we  give  some  details  of  the  research  results 
obtained  in  the  previous  year. 

1.1.  Evaluation  of  algoritbni'based  fault  tolerance  in  tbe  presence  of  round-off  errors 

Algorithm-based  techniques  are  based  on  checking  for  the  preservation  of  certain  properties  possessed  by 
global  data  following  a  set  of  computadons.  This  often  involves  the  introduedon  of  a  check  variable  which  is 
updated  in  such  a  manner  that,  in  the  absence  of  roundoff  errors,  it  equals  the  value  of  some  funedon  which 
involves  all  the  data  elements  pardcipadng  in  the  algorithm.  Usually,  the  funedon  chosen  is  a  sum  over  all  data 
elements,  known  as  a  checksum,  since  this  is  both  easy  to  compute  and  leads  to  easy  determinadon  of  the  rules  for 
updadng  the  check  variables.  However,  the  fact  that  roundoff  errors  accumulate  in  different  ways  in  the  updates 
involving  the  check  variables  and  the  computadons  involving  data  elements  make  it  highly  unlikely  that  the  equal¬ 
ity  is  preserved  exactly  for  an  implementadon  of  the  algorithm  on  a  real  computer.  Thus,  the  check  step  involves 
verifying  the  preservadon  of  the  equality  to  within  a  tolerance  value.  So  far,  two  approaches  have  been  taken  for 
the  determinadon  of  the  tolerance.  One  is  an  experimental  method  which  requires  the  data  sets  involved  to  be  of 
fixed  size  and  range  to  be  effeedve.  Another  more  recent  approach  has  suggested  extraedng  the  mandssas  of  the 
floadng  point  quanddes  involved  and  applying  an  exact  integer  checksum  test  to  products  involving  mandssas 
prior  to  floadng  point  summadon.  However,  errors  in  the  floadng  point  summadons  cannot  be  easily  checked.  In 
this  research,  we  proposed  a  method  for  determinadon  of  the  tolerance  based  on  error  analysis  techniques.  In  the 
interest  of  rapid  derivadon  of  the  tolerance  expressions,  some  simplificadons  are  introduced  in  the  analysis  process 
which,  however,  do  not  lessen  the  effeedveness  of  the  tolerance  expression  derived.  We  present  results  on  three 
numerical  algorithms  which  show  the  effeedveness  of  our  approach  for  data  sets  of  varying  sizes  and  data  ranges. 

We  have  applied  this  technique  to  three  parallel  applicadons  on  an  Intel  iPSC  hypercube  muldprocessor,  namely. 
Matrix  muldplicadon,  Gaussian  Elinunadon  and  QR  factorizadon.  Error  coverages  of  about  90-100%  were 
observed  over  a  wide  range  of  data  sizes  and  data  ranges. 

1.2.  Investigation  of  Novel  Schemes  of  Algorithm-based  fault  tolerance 

Algorithm-based  schemes  have  been  proposed  for  a  wide  variety  of  numerical  applicadons  by  us  and  other 
researchers  in  the  past.  Applications  that  were  developed  in  the  past  include  matrix  muldplicadon.  Fast  Fourier 
Transform,  Gaussian  eliminadon,  QR  factorization.  Altering,  etc.  Those  schemes  were  all  based  on  either  the 
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checksum  encoding  or  the  sum-of-squares  encoding. 

However,  for  a  particular  class  of  numerical  applications,  namely  those  involving  the  iterative  solution  of 
linear  systems,  there  exist  almost  no  fault-tolerant  algorithms  in  the  literature.  In  this  research,  we  describe  a 
fault-tolerant  version  of  a  parallel  algorithm  for  iteratively  solving  the  Laplace  equation  over  a  grid.  The  fault- 
tolerant  algorithm  is  based  on  the  popular  successive  overrelaxation  scheme  with  red-black  ordering.  We  use  the 
Laplace  equation  merely  as  an  illustration;  fault-tolerant  versions  of  other  iterative  schemes  for  solution  of  linear 
systems  arising  from  discretizations  of  other  partial  dififerential  equations  may  be  similarly  derived. 

We  also  present  a  new  way  of  dealing  with  the  roundoff  errors  which  complicate  the  check  phase  of 
algorithm-based  schemes.  Our  approach  is  based  on  error  analysis  incorporating  some  simplifications  and  gives 
high  fault  coverage  and  no  false  alarms  for  a  large  variety  of  data  sets,  as  shown  by  our  results. 

The  timing  overheads  of  our  fault-tolerant  algorithm  over  the  basic  SOR  algorithm  involving  no  fault  toler¬ 
ance  decrease  with  increasing  problem  dimension  and  become  negligible  for  large  data  sizes. 


U.  Compiler  Assisted  Synthesis  of  ABFT  techniques 

All  the  algorithm-based  fault  detection  techniques  proposed  in  the  past  have  been  application  and  algorithm 
specific;  they  had  been  designed  by  the  user  by  exploiting  some  particular  features  of  the  algorithms.  If  one 
wishes  to  apply  these  techniques  to  a  large  set  of  useful  parallel  applications  on  real  parallel  machines,  each  of 
these  parallel  applications  will  have  to  be  rewritten  with  ABFT  checks.  Such  a  procedure  would  be  clearly  quite 
tedious  for  the  user  and  therefore  not  very  useful.  In  this  research,  we  are  therefore  exploring  the  possibility  of 
automating  the  process  of  designing  the  ABFT  checks  on  existing  application  programs. 


We  have  proposed  a  theoretical  basis  for  synthesizing  algorithm-based  checking  techniques  for  general 
applications  at  the  compiler  level.  The  basic  approach  is  to  identify  linear  transformations  in  Fortran  DO  loops, 
restructuring  program  statements  to  convert  non-linear  transformations  to  linear  ones,  and  inserting  system-level 
checks  based  on  this  property.  In  previous  years,  we  had  developed  the  framework  of  such  a  compiler. 

We  have  implemented  a  source-to-source  restructuring  compiler,  CRAFT  (CompileR  assisted  Algorithm- 
based  Fault  Tolerance),  for  the  synthesis  of  low-cost  system-level  checks  for  genera]  numerical  Fortran  programs, 
based  on  the  above  approach.  We  have  used  Parafrase-2,  an  existing  source-to-source  vectorizing  package,  to  aid 
us  in  the  CRAFT  project. 

The  LINTEST  pass  of  our  compiler  performs  automatic  detection  of  linear  statements  in  a  given  program 
followed  by  linearization  of  other  statements.  LINTEST  is  executed  in  two  phases  or  sub-passes.  The  first  of 
these  is  the  GETLIN  phase,  which  goes  through  the  entire  program  and  identifies  symbolically  linear  statements. 
The  second  phase  uses  the  above  information  and  calls  one  of  the  following  two  procedures,  depending  on  the 
nature  of  the  statement  currently  being  processed:  (1)  the  PROGLIN  procedure;  (2)  the  MAKELIN  procedure. 
The  GETLIN  phase  performs  step-by-step  detection  of  symbolic  linearity  for  all  assignment  statements  (inside 
loops).  The  ADDCHECK  pass  of  the  CRAFT  compiler  operates  on  the  linearized  version  of  the  original  program 
and  generates  suitable  check  code  for  providing  on-line  detection  of  errors  that  may  occur  during  actual  execution 
of  the  program. 
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We  have  applied  some  of  the  compiler  assisted  synthesis  of  algorithm  based  checks  to  some  real  scientific 
FORTRAN  programs.  Some  sample  routines  from  the  LINPACK  library,  (namely,  DGEFA  and  DGESL)  and 
EISPACK  library  (namely  TRED2  and  TQL2),  and  two  programs  from  the  Perfect  Club  Benchmark  Suite  (namely 
TRFD  and  MDG  and  their  subroutines),  were  considered.  For  each  of  those  applications,  we  have  obtained 
detailed  experimental  results. 

The  main  thrust  of  our  compiler  assisted  approach  has  been  to  producing  algorithm-based  checks,  hence  our 
compiler  only  inserts  checks  to  an  already  existing  parallel  program.  To  make  the  process  more  powerful,  one  can 
use  such  a  compiler  as  a  backend  to  a  parallelizing  compiler.  The  status  of  parallelizing  compilers  for  distributed 
memory  machines  for  which  our  algorithm  based  checking  methods  are  applicable  is  quite  premature.  Unfor-  e\  y-,.  / 
tunately  no  such  mature  compiler  exists  that  we  could  use.  Hence  an  interesting  research  that  we  started  this  year  — /- 

was  to  adapt  some  of  the  recent  advances  in  distributed  memory  compilers  and  implement  those  ideas  in  our  -'0^/ 
PARADIGM  compiler. 
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The  PARADIGM  compiler  project  at  Illinois  provides  an  automated  means  to  convert  sequential  programs 
for  execution  on  distributed  memory  multicomputers.  This  is  accomplished  in  two  steps.  The  first  step  involves 
the  automatic  parallelization  of  the  sequential  program  into  a  shared  memory  parallel  program  using  traditional 
compiler  dependence  analysis.  The  second  step  involves  the  data  partitioning,  computation  partition,  and  com¬ 
munication  generation  steps  for  the  parallel  program  for  efficient  execution  on  distributed  memory  machines.  The 
compiler  is  targeted  to  structured  parallel  numerical  applications  written  in  FORTRAN  77  that  operate  on  regular 
data  structures  such  as  matrices.  The  sequential  FORTRAN  programs  for  regular  applications  are  automatically 
parallelized  using  a  parallelizing  compiler  (we  use  Parafiase-2),  and  the  PARADIGM  compiler  performs  several 
compiler  transformations  on  the  program,  and  then  generates  efficient  message  passing  FORTRAN  code.  The 
basic  framework  of  the  compiler  is  similar  to  the  FORTRAN  D  compiler  from  Rice  University.  Numerous  com¬ 
piler  optimizations  such  as  loop  bounds  reduction,  message  vectorization,  message  chaining,  message  aggregation 
and  pipelining,  are  automatically  performed  by  the  PARADIGM  compiler. 

In  addition,  the  PARADIGM  compiler  is  unique  in  its  ability  ( 1 )  to  perform  automated  data  distribution  for 
regular  data  structures,  something  that  has  to  be  conventionally  specified  as  a  compiler  directive  in  FORTRAN  D 
and  High  Performance  FORTRAN;  (2)  to  generate  high-level  communication  pnmitives,  such  as  EXPRESS 
library  calls;  (3)  Simultaneous  exploitation  of  functional  and  data  parallelism;  (4)  Generating  multithreaded 
message-driven  code  (as  opposed  to  conventional  SPMD  message  send-receive  code)  to  hide  the  long  latencies  of 
communication,  by  overlapping  computation  with  communication.  We  have  already  implemented  such  a  compiler 
which  generates  code  for  machines  such  as  the  Intel  iPSC/860.  the  Intel  Paragon,  and  the  Connection  Machine 
CM-5. 


1.4.  System  Level  Reconfiguration  and  Recovery  Techniques 

Once  a  faulty  processor  has  been  detected  using  our  algorithm  based  scheme,  one  needs  to  investigate 
schemes  for  reconfiguring  the  system  around  the  faulty  processors.  We  have  proposed  two  hardware  approaches 
and  a  software  approach  for  reconfiguring  hypercube  and  mesh  based  multiprocessors. 

In  the  first  hardware  scheme  for  reconfiguration,  we  assume  that  spare  processors  can  be  attached  to  specific 
processors  of  the  hypercube  or  mesh.  In  this  approach,  we  designate  two  types  of  nodes  (P-nodes  and  S-nodes)  in 
the  hypcrcube  or  mesh.  P-nodes  correspond  to  conventional  processing  nodes  in  distributed  memory  systems.  S- 
nodes  correspond  to  nodes  that  have  a  spare  processor  and  memory  attached  to  the  node.  The  message  routing 
logic  is  shared  between  the  two  processing  units.  If  we  assign  one  spare  processor  to  a  set  of  active  processors, 
then  the  hardware  overhead  can  be  minimized.  We  have  developed  appropriate  embedding  algorithms  for  hyper¬ 
cube  and  mesh  topologies. 

The  second  hardware  approach  places  spare  processors  along  specific  links  in  the  hypercube  or  mesh. 
Hence  the  node  hardware  is  not  changed  at  all.  We  have  investigated  appropriate  embeddings  of  cost  effective 
ways  to  place  spare  processors  in  selected  links  of  a  hypercube  and  mesh.  We  have  again  formally  developed 
reconfigiuation  algorithms  for  this  framework. 

Both  the  hardware  schemes  involve  mapping  of  logical  links  of  a  virtual  machine  onto  a  set  of  physical 
links  in  the  final  reconfigured  machine  and  hence  suffer  some  performance  degradation.  We  have  analyzed  the 
performance  degradation  theoretically  and  experimentally.  Our  theoretical  studies  have  led  to  bounds  on  the  dila¬ 
tion  of  the  logical  links  in  both  the  schemes.  We  have  performed  simulation  studies  to  verify  these  bounds.  The 
simulations  have  been  performed  on  several  large  parallel  applications  running  on  an  Intel  iPSC/2  hypercube. 

A  third  scheme  for  reconfiguration  that  we  have  proposed  in  this  research  has  been  on  a  software 
reconfiguration  strategy  for  hypercube  multicomputer  architectures  under  multiple  faults.  The  advantage  of  this 
reconfiguration  strategy  over  previous  reconfiguration  schemes  is  that  it  requires  no  redundant  hardware,  but  sup¬ 
ports  reconfiguration  through  graceful  degradation.  It  is  based  on  the  notion  of  using  multiple  virtual  processors 
on  a  single  physical  processor  and  using  these  virtual  processors  for  workload  redistribution  under  faults.  We 
have  actually  implemented  a  software  reconfiguration  scheme  based  on  the  notion  of  virtual  processors.  We  have 
performed  a  case  study  of  the  performance  degradation  of  the  reconfiguration  scheme  on  a  commercially  available 
Intel  iPSC/2  hypercube  multicomputer  for  several  actual  parallel  applications. 

Finally,  we  have  been  investigated  a  dynamic  software  reconfiguration  and  recovery  scheme  using  the 
object-oriented  actor  model  of  computation.  The  actor  programming  model  involves  a  message  driven  style  of 
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execution,  where  each  task  represents  an  amount  of  work  that  is  carried  on  until  completion.  At  the  end  of  the 
execution  of  the  current  actor,  messages  are  sent  to  other  actors  in  an  asynchronous  manner.  Each  processors  sits 
in  a  pick-next-actor  loop  executing  messages. 

We  have  used  the  above  parallel  programming  environment  to  develop  a  system  level  reconfiguration 
scheme.  For  every  actor  task  in  the  original  system,  we  create  a  shadow  actor  task  which  gets  automatically 
placed  on  a  different  processor.  Any  message  communication  between  original  actor  tasks  are  also  sent  to  the  sha¬ 
dow  actors.  The  shadow  actors  are  passive  replicas  of  the  original  tasks,  and  do  not  actually  execute  anytime 
under  normal  conditions.  Under  a  fault,  all  actors  currently  executing  on  the  faulty  processor  are  killed,  and  their 
shadows  on  different  processors  are  initiated  to  continue  the  work.  We  have  implemented  such  a  system  using  the 
CHARM  parallel  programming  systems,  and  have  evaluated  the  overheads  of  our  scheme  on  a  set  of  complex 
parallel  applications. 
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4.  TRANSITIONS  AND  INTERACTIONS 

The  principal  investigator's  work  in  algorithm-based  fault  tolerance  has  been  picked  up  by  researchers  in 
General  Electric  (Dr.  A.  Chatterjee  and  Dr.  M.  d’Abreau).  GE  researchers  have  applied  the  fault  tolerance 
research  to  their  signal  processing  systems. 

The  principal  investigator’s  work  on  algorithm  based  fault  tolerance  has  been  followed  up  by  researchers  in 
other  universities,  namely,  Stanford  (Prof.  E.  McCIuskey),  Princeton  (Prof.  N.  K.  Jha),  SUNY  Buffalo  (Prof.  S.  S. 
Ravi),  and  University  of  Texas  at  Austin  (Prof.  J.  Abraham) 

The  work  on  the  PARADIGM  compiler  has  been  picked  up  by  several  universities  and  companies,  namely. 
Rice  Univ.  (Prof.  Kennedy),  Syracuse  Univ.  (Prof.  Fox  and  Prof.  Choudhary),  Ohio  State  (Prof.  Sadayappan), 
Motorola  (Dr.  Natarajan).  The  work  on  the  CRAFT  back-end  to  the  PARADIGM  compiler  has  been  picked  up  by 
the  Univ.  of  Pittsburg  (Prof.  Melhem  and  Prof.  Gupta). 
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5.  SOFTWARE  AND  HARDWARE  PROTOTYPES 

Several  software  packages  are  being  developed  from  this  research  over  the  past  three  years. 

1.  The  CRAFT  compiler  has  being  developed  which  uses  the  PARAFRASE  parallelizing  compiler  as  a  front  end 
to  parse  sequential  FORTRAN  programs,  and  analyze  the  statements  for  linearity,  and  apply  linearity  based  checks 
to  the  programs  running  on  individual  node  processors  of  a  distributed  memory  multiprocessor. 

2.  A  parallelizing  compiler  for  distributed  memory  machines  called  PARADIGM  has  been  developed.  The  com¬ 
piler  performs  automated  data  partitioning,  exploits  function  and  data  parallelism,  and  optimizes  communication  in 
many  ways. 

3.  An  experimental  software  environment  is  being  developed  for  analyzing  actual  fault  and  error  coverage  figures 
for  real  parallel  applications  using  error  injection  in  various  types  of  functional  elements  such  as  floating  point 
units,  integer  units,  memories,  communication  paths,  control  units  etc.  All  this  is  done  in  the  presence  of  finite 
precision  errors.  The  testbed  on  which  this  software  runs  is  the  Intel  iPSCy2  hypercube. 

4.  Several  software  means  of  reconfiguration  has  been  developed  on  an  Intel  iPSC/2  hypercube.  One  uses  a  static 
reconfiguration  strategy.  The  second  uses  a  dynamic  strategy. 


